24 research outputs found
Recognizing Multi-talker Speech with Permutation Invariant Training
In this paper, we propose a novel technique for direct recognition of
multiple speech streams given the single channel of mixed speech, without first
separating them. Our technique is based on permutation invariant training (PIT)
for automatic speech recognition (ASR). In PIT-ASR, we compute the average
cross entropy (CE) over all frames in the whole utterance for each possible
output-target assignment, pick the one with the minimum CE, and optimize for
that assignment. PIT-ASR forces all the frames of the same speaker to be
aligned with the same output layer. This strategy elegantly solves the label
permutation problem and speaker tracing problem in one shot. Our experiments on
artificially mixed AMI data showed that the proposed approach is very
promising.Comment: 5 pages, 6 figures, InterSpeech201
Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning
Self-supervised learning (SSL) of speech has shown impressive results in
speech-related tasks, particularly in automatic speech recognition (ASR). While
most methods employ the output of intermediate layers of the SSL model as
real-valued features for downstream tasks, there is potential in exploring
alternative approaches that use discretized token sequences. This approach
offers benefits such as lower storage requirements and the ability to apply
techniques from natural language processing. In this paper, we propose a new
protocol that utilizes discretized token sequences in ASR tasks, which includes
de-duplication and sub-word modeling to enhance the input sequence. It reduces
computational cost by decreasing the length of the sequence. Our experiments on
the LibriSpeech dataset demonstrate that our proposed protocol performs
competitively with conventional ASR systems using continuous input features,
while reducing computational and storage costs.Comment: Accepted at INTERSPEECH 202
HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model
Recently, the usefulness of self-supervised representation learning (SSRL)
methods has been confirmed in various downstream tasks. Many of these models,
as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral
features or the model's own representation features. From previous studies, it
is known that the pseudo-labels contain semantic information. However, the
masked prediction task, the learning criterion of HuBERT, focuses on local
contextual information and may not make effective use of global semantic
information such as speaker, theme of speech, and so on. In this paper, we
propose a new approach to enrich the semantic representation of HuBERT. We
apply topic model to pseudo-labels to generate a topic label for each
utterance. An auxiliary topic classification task is added to HuBERT by using
topic labels as teachers. This allows additional global semantic information to
be incorporated in an unsupervised manner. Experimental results demonstrate
that our method achieves comparable or better performance than the baseline in
most tasks, including automatic speech recognition and five out of the eight
SUPERB tasks. Moreover, we find that topic labels include various information
about utterance, such as gender, speaker, and its theme. This highlights the
effectiveness of our approach in capturing multifaceted semantic nuances.Comment: Submitted to IEEE ICASSP 202
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
We propose a decoder-only language model, \textit{VoxtLM}, that can perform
four tasks: speech recognition, speech synthesis, text generation, and speech
continuation. VoxtLM integrates text vocabulary with discrete speech tokens
from self-supervised speech features and uses special tokens to enable
multitask learning. Compared to a single-task model, VoxtLM exhibits a
significant improvement in speech synthesis, with improvements in both speech
intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90.
VoxtLM also improves speech generation and speech recognition performance over
the single-task counterpart. VoxtLM is trained with publicly available data and
training recipes and model checkpoints will be open-sourced to make fully
reproducible work
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition
We present TokenSplit, a speech separation model that acts on discrete token
sequences. The model is trained on multiple tasks simultaneously: separate and
transcribe each speech source, and generate speech from text. The model
operates on transcripts and audio token sequences and achieves multiple tasks
through masking of inputs. The model is a sequence-to-sequence encoder-decoder
model that uses the Transformer architecture. We also present a "refinement"
version of the model that predicts enhanced audio tokens from the audio tokens
of speech separated by a conventional separation model. Using both objective
metrics and subjective MUSHRA listening tests, we show that our model achieves
excellent performance in terms of separation, both with or without transcript
conditioning. We also measure the automatic speech recognition (ASR)
performance and provide audio samples of speech synthesis to demonstrate the
additional utility of our model.Comment: INTERSPEECH 2023, project webpage with audio demos at
https://google-research.github.io/sound-separation/papers/tokenspli